Titanic Survival Prediction

Machine Learning
AI Engineering
Classification
An attempt to predict Titanic passenger survival using classification techniques.
Author

DOSSEH Ameck Guy-Max Désiré

Published

September 5, 2025

Estimated reading time: ~30 minutes

Practice Project: Titanic Survival Prediction

Objectives

  • Build a classification pipeline with preprocessing and model selection
  • Tune hyperparameters via cross-validation
  • Compare Random Forest and Logistic Regression
  • Interpret results via reports, confusion matrices, and feature importance

Dataset and features

We use the Titanic dataset (via Seaborn). The goal is to predict survived from a set of demographic and ticket features. A sample of the dataset is shown below.

# Imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Load dataset
titanic = sns.load_dataset('titanic')
titanic.head(10)

Titanic dataset head

Target: survived (0 = No, 1 = Yes)

Example features used: - Numerical: pclass, age, sibsp, parch, fare - Categorical/boolean: sex, class, who, adult_male, alone

Class balance (counts of the target) is shown below.

# Class counts (baseline check)
titanic['survived'].value_counts().sort_index()

Class balance

Modeling approach (no code)

  • Split train/test with stratification.
  • Preprocess using ColumnTransformer:
    • Numerical: impute median + standardize
    • Categorical: impute most_frequent + one-hot encode
  • Model 1: RandomForestClassifier with grid search over depth and splits
  • Model 2: LogisticRegression with grid search over penalty, solver, and class weights
  • Evaluate on the held-out test set.
# Feature selection and target
features = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'class', 'who', 'adult_male', 'alone']
target = 'survived'
X = titanic[features]
y = titanic[target]

# Train/test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2, stratify=y, random_state=42
)

# Detect feature types
numerical_features = X_train.select_dtypes(include=['number']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category', 'bool']).columns.tolist()

# Preprocessing pipelines
numerical_transformer = Pipeline(steps=[
  ('imputer', SimpleImputer(strategy='median')),
  ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
  ('imputer', SimpleImputer(strategy='most_frequent')),
  ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
  transformers=[
    ('num', numerical_transformer, numerical_features),
    ('cat', categorical_transformer, categorical_features)
  ]
)

Random Forest results

Classification report

# Random Forest pipeline and grid search
rf_pipeline = Pipeline(steps=[
  ('preprocessor', preprocessor),
  ('classifier', RandomForestClassifier(random_state=42))
])

rf_param_grid = {
  'classifier__n_estimators': [100],
  'classifier__max_depth': [None, 10, 20],
  'classifier__min_samples_split': [2, 5]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
rf_model = GridSearchCV(
  estimator=rf_pipeline,
  param_grid=rf_param_grid,
  cv=cv,
  scoring='accuracy',
  verbose=0
)

rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
print(classification_report(y_test, y_pred_rf))

Random Forest Classification Report ### Confusion matrix

# Confusion matrix (RF)
conf_rf = confusion_matrix(y_test, y_pred_rf)
plt.figure(figsize=(4.5, 4))
sns.heatmap(conf_rf, annot=True, cmap='Blues', fmt='d')
plt.title('Random Forest Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.tight_layout()
plt.show()

Random Forest Confusion Matrix ### Feature importances

# Feature importances (RF)
rf_best = rf_model.best_estimator_
rf_importances = rf_best.named_steps['classifier'].feature_importances_
ohe_feature_names = rf_best.named_steps['preprocessor'] \
  .named_transformers_['cat'] \
  .named_steps['onehot'] \
  .get_feature_names_out(categorical_features)
feature_names = numerical_features + list(ohe_feature_names)

imp_df = pd.DataFrame({'Feature': feature_names, 'Importance': rf_importances}) \
  .sort_values(by='Importance', ascending=False)

top_n = min(25, len(imp_df))
plt.figure(figsize=(10, 8))
plt.barh(imp_df['Feature'].head(top_n)[::-1], imp_df['Importance'].head(top_n)[::-1], color='skyblue')
plt.title(f'Most Important Features (RF) — Test Acc: {rf_model.score(X_test, y_test):.2%}')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()

Random Forest Feature Importances Notes: - Random Forest highlights non-linear interactions and handles mixed features well. - Feature importances rank transformed (one-hot) and numeric features together.

Logistic Regression results

Classification report

# Logistic Regression pipeline and grid
lr_pipeline = Pipeline(steps=[
  ('preprocessor', preprocessor),
  ('classifier', LogisticRegression(random_state=42, max_iter=1000))
])

lr_param_grid = {
  'classifier__solver': ['liblinear'],
  'classifier__penalty': ['l1', 'l2'],
  'classifier__class_weight': [None, 'balanced']
}

lr_model = GridSearchCV(
  estimator=lr_pipeline,
  param_grid=lr_param_grid,
  cv=cv,
  scoring='accuracy',
  verbose=0
)

lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)
print(classification_report(y_test, y_pred_lr))

Logistic Regression Classification Report ### Confusion matrix

# Confusion matrix (LR)
conf_lr = confusion_matrix(y_test, y_pred_lr)
plt.figure(figsize=(4.5, 4))
sns.heatmap(conf_lr, annot=True, cmap='Blues', fmt='d')
plt.title('Logistic Regression Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.tight_layout()
plt.show()

Logistic Regression Confusion Matrix ### Coefficient magnitudes

# Coefficient magnitudes (LR)
lr_best = lr_model.best_estimator_
coefficients = lr_best.named_steps['classifier'].coef_[0]
ohe_feature_names_lr = lr_best.named_steps['preprocessor'] \
  .named_transformers_['cat'] \
  .named_steps['onehot'] \
  .get_feature_names_out(categorical_features)
feature_names_lr = numerical_features + list(ohe_feature_names_lr)

coef_df = pd.DataFrame({'Feature': feature_names_lr, 'Coefficient': coefficients}) \
  .sort_values(by='Coefficient', key=lambda s: s.abs(), ascending=False)

top_n = min(25, len(coef_df))
plt.figure(figsize=(10, 8))
plt.barh(coef_df['Feature'].head(top_n)[::-1], coef_df['Coefficient'].abs().head(top_n)[::-1], color='salmon')
plt.title(f'Logistic Regression Coefficient Magnitudes — Test Acc: {lr_model.score(X_test, y_test):.2%}')
plt.xlabel('Absolute Coefficient')
plt.tight_layout()
plt.show()

Logistic Regression Coefficient Magnitudes Notes: - Coefficients reflect the linear contribution (after preprocessing). - Magnitudes are not directly comparable to tree-based importances, but provide directionality.

Comparison and takeaways

  • Both models achieve comparable accuracy on this dataset.
  • Differences in feature rankings suggest correlated variables and overlapping information (e.g., sex, who_*, age).
  • Next steps: refine features, consider interaction terms, calibrate probabilities, and explore additional models.